library(tidyverse)
library(readxl)
library(janitor)
The Excel file read in this example is analytic_data.xlxs. Replace this with your Excel file.
In this R Markdown file, the data frame is called EXAMPLE_DATA. Replace this with the name of the file you wish to use.
EXAMPLE_DATA <- read_excel("analytic_data.xlsx")
EXAMPLE_DATA <- EXAMPLE_DATA %>%
mutate_if(is.character,as.factor)
In all of the code below, you will need to replace EXAMPLE_DATA with the name of your data frame. You will need to use the appropriate variable names also. Remember to save any changes by assigning the result to a data frame (replacing the old or creating a new one, in this case we have used UPDATED_DATA).
The convention with rename is new = old.
There is also a built in function in the janitor package
called clean_names which can be an easy way to clean a
series of names at once.
UPDATED_DATA <- EXAMPLE_DATA %>%
rename(NEW_NAME_CAT1 = CATEGORICAL_VARIABLE1,
NEW_NAME_CAT2 = CATEGORICAL_VARIABLE2)
UPDATED_DATA <- EXAMPLE_DATA %>%
clean_names()
We can do this by name, column number and various characteristics
UPDATED_DATA <- EXAMPLE_DATA %>%
select(contains("VARIABLE")) %>%
select(CATEGORICAL_VARIABLE1:NUMERICAL_VARIABLE3) %>%
select(c(1,4:5))
We can use filter to select rows based on a condition (or series of conditions) and slice to select by row number.
UPDATED_DATA <- EXAMPLE_DATA %>%
filter(CATEGORICAL_VARIABLE1 == "A") %>%
slice(1:5)
This may include mutating numerical variables or recoding factors. A
range of functions for working with factors can be found in the
forcats package.
UPDATED_DATA <- EXAMPLE_DATA %>%
mutate(NEW_NUMERICAL_VARIABLE = log(NUMERICAL_VARIABLE1 + NUMERICAL_VARIABLE2),
NEW_CATEGORICAL_VARIABLE2 = fct_recode(CATEGORICAL_VARIABLE2, "Yes"= "Y", "No"="N"))
Using built-in R missing data code
UPDATED_DATA <- EXAMPLE_DATA %>%
mutate(NEW_CATEGORICAL_VARIABLE2 = na_if(CATEGORICAL_VARIABLE2, "N"))
UPDATED_DATA <- EXAMPLE_DATA %>%
summarise(across(NUMERICAL_VARIABLE1:NUMERICAL_VARIABLE3, ~mean(., na.rm = TRUE)))
A second data set with a common variable (TIME_VARIABLE) has been created in order to demonstrate a range of way to join data sets.
SECOND_EXAMPLE_DATA <- tribble(
~ TIME_VARIABLE, ~NEW_DATA,
1, 1.6,
2, 1.7,
3, 1.8,
4, 1.9,
5, 2.1,
9, 3.2,
10,4.1,
11, 4.6,
14, 4.9,
15, 6.5,
16, 6.7,
17, 7.9,
18, 10.1,
19, 14.6,
20, 20.8,
21, 20.9,
22, 24.6,
23, 30.1,
24, 31.3,
)
UPDATED_DATA <- left_join(EXAMPLE_DATA, SECOND_EXAMPLE_DATA, by = "TIME_VARIABLE")
UPDATED_DATA <- right_join(EXAMPLE_DATA, SECOND_EXAMPLE_DATA, by = "TIME_VARIABLE")
UPDATED_DATA <- full_join(EXAMPLE_DATA, SECOND_EXAMPLE_DATA, by = "TIME_VARIABLE")
UPDATED_DATA <- EXAMPLE_DATA %>%
mutate(NEW_ID = rep(1:10,2)) %>%
select(NEW_ID, CATEGORICAL_VARIABLE1, NUMERICAL_VARIABLE1) %>%
pivot_wider(names_from = CATEGORICAL_VARIABLE1, values_from=NUMERICAL_VARIABLE1)
UPDATED_DATA <- UPDATED_DATA %>%
pivot_longer(A:B,
names_to = "CATEGORICAL_VARIABLE1",
values_to = "NUMERICAL_VARIABLE1")
© Statistical Consulting Centre, University of Melbourne, 2023